Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Gil zeevi | 203909320 | gil.zeevi@post.idc.ac.il |
| Student 2 | Joel Liurner | 346243579 | joel.liurner@post.idc.ac.il |
In this assignment you will:

checkpoint to download the trained model back to your computer for inference and submission.hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.project dir, and the training process (the logs), the graphs and all the explanation should be in the corispond notebook. try to implement it close to the HW assigments, as the model implementation, losses etc.. all in .py files, while the things you want to present is in the notebook.Good luck with the project and with your exams!

In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.
We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)
However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw4/answers.py.
import cs3600.plot as plot
import cs3600.download
from hw4.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs3600.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip exists, skipping download. Extracting C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip... Extracted 531 to C:\Users\Gil zeevi\.pytorch-datasets\lfw/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
C:\Users\Gil zeevi\anaconda3\lib\site-packages\torchvision\io\image.py:11: UserWarning: Failed to load image Python extension: Could not find module 'C:\Users\Gil zeevi\anaconda3\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
warn(f"Failed to load image Python extension: {e}")
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).
While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.
We define, in Baysean terminology,
To create our variational decoder we'll further specify:
This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.
Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.
To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):
$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.
Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} {\bb{x}} \left[ \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ -\log p {\bb{\beta}}(\bb{x} | \bb{z}) \right]
By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as
$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).
First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.
import hw4.autoencoder as autoencoder
in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)
h = encoder_cnn(x0)
print(h.shape)
test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(256, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
)
)
torch.Size([1, 1024, 4, 4])
Now let's implement the CNN part of the Decoder.
Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced
by your EncoderCNN and output an image of the same dimensions as the Encoder's input was.
This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc.
Consult the documentation of ConvTranspose2D
to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.
TODO: Implement the DecoderCNN class in the hw4/autoencoder.py module.
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)
test.assertEqual(x0.shape, x0r.shape)
# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
)
)
torch.Size([1, 3, 64, 64])
Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:
\bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\
\log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}}
\end{align}
$$Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.
TODO: Implement the encode() method in the VAE class within the hw4/autoencoder.py module.
You'll also need to define your parameters in __init__().
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)
z, mu, log_sigma2 = vae.encode(x0)
test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)
print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(256, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
)
)
(create_mean): Linear(in_features=16384, out_features=2, bias=True)
(create_logsigma): Linear(in_features=16384, out_features=2, bias=True)
(apply_reverse): Linear(in_features=2, out_features=16384, bias=True)
)
mu(x0)=[0.16524835, 0.014670414], sigma2(x0)=[0.96668094, 0.88182473]
Let's sample some 2d latent representations for an input image x0 and visualize them.
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
for i in range(N):
Z[i], _, _ = vae.encode(x0)
ax.scatter(*Z[i].cpu().numpy())
# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([ 0.1517, -0.0519]) sampled sigma2 tensor([1.0483, 0.7446])
Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:
TODO: Implement the decode() method in the VAE class within the hw4/autoencoder.py module.
You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.
x0r = vae.decode(z)
test.assertSequenceEqual(x0r.shape, x0.shape)
Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.
x0r, mu, log_sigma2 = vae(x0)
test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:
$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.
TODO: Implement the vae_loss() function in the hw4/autoencoder.py module.
from hw4.autoencoder import vae_loss
torch.manual_seed(42)
def test_vae_loss():
# Test data
N, C, H, W = 10, 3, 64, 64
z_dim = 32
x = torch.randn(N, C, H, W)*2 - 1
xr = torch.randn(N, C, H, W)*2 - 1
z_mu = torch.randn(N, z_dim)
z_log_sigma2 = torch.randn(N, z_dim)
x_sigma2 = 0.9
loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
return loss
test_vae_loss()
tensor(58.3234)
The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.
TODO: Implement the sample() method in the VAE class within the hw4/autoencoder.py module.
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)
Time to train!
TODO:
VAETrainer class in the hw4/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.part2_vae_hyperparams() function within the hw4/answers.py module.import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw4.training import VAETrainer
from hw4.answers import part2_vae_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']
# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test = DataLoader(ds_test, batch_size, shuffle=True)
im_size = ds_train[0][0].shape
# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)
# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)
# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show model and hypers
print(vae)
print(hp)
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(64, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
)
)
(create_mean): Linear(in_features=2048, out_features=32, bias=True)
(create_logsigma): Linear(in_features=2048, out_features=32, bias=True)
(apply_reverse): Linear(in_features=32, out_features=2048, bias=True)
)
{'batch_size': 53, 'h_dim': 128, 'z_dim': 32, 'x_sigma2': 0.002, 'learn_rate': 0.0001, 'betas': (0.9, 0.999)}
TODO:
_final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.The images you get should be colorful, with different backgrounds and poses.
import IPython.display
def post_epoch_fn(epoch, train_result, test_result, verbose):
# Plot some samples if this is a verbose epoch
if verbose:
samples = vae.sample(n=5)
fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
checkpoint_file = checkpoint_file_final
else:
res = trainer.fit(dl_train, dl_test,
num_epochs=200, early_stopping=20, print_every=10,
checkpoints=checkpoint_file,
post_epoch_fn=post_epoch_fn)
# Plot images from best model
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
*** Loading final checkpoint file checkpoints/vae_final instead of training *** Images Generated from best model:
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs3600.answers import display_answer
import hw4.answers
What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.
display_answer(hw4.answers.part2_q1)
Your answer:
A parametric likelihood distribution was denoted as, $p _{\bb{\beta}}(\bb{X} | \bb{Z}=\bb{z}) = \mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$, where $\sigma^2$ represents the variance of the normal distribution.
In general, Increasing $\sigma^2$ will yield a wider normal distribution with a cruve broader and shorter. if the variance is small (where
most values occur very close to the mean), the curve will be narrow and tall in the middle.
so from that statistical reasoning we conclude that increasing $\sigma^2$ will get us more 'variablitiy' in sampling - samples will probably
look different from the original dataset. on the other hand, a low value of $\sigma^2$ will yieald samples similar to the ones in the dataset.
Furthermore, increasing $\sigma^2$ will result in less significat data-reconstruction loss $\rightarrow$ data-reconstruction loss decreases and the variance of the generated data will be large. on the contrary, decreasing $\sigma^2$, the generated data becomes more 'biased' towards the training data.
display_answer(hw4.answers.part2_q2)
Your answer:
1)
reconstruction loss - constructs the data to be as close as possible to the original input data using Mean Squared Error.
minimizing it means we minimize the difference between the real observation $x$ and the VAE output (encoded and decoded)
$\Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) +
\bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right)$
KL divergence loss - approximates the latent space distribution to be as close to some known informative distribution like the gaussian/normal distribution
2)
VAE tries tp estimate the evidence distribution $p(X)$, but this is a hard task so we maximize a lower bound on $log(p(X))$.
The lower bound denoted by $ \log p(\bb{X}) \ge \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right]
In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?
display_answer(hw4.answers.part2_q3)
Your answer:
we start by maximizing the evidence distribution, $p(\bb{X})$, because we dont have that evidence distribution.
we estimate it in order to generate new data and by doing so, we are tightening up the lower bound $ \log p(\bb{X}) \ge \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right]
In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?
display_answer(hw4.answers.part2_q4)
Your answer:
taking the log not only simplifies the subsequent mathematical analysis by reducing multipication and division to addition and subtraction, but it also helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities.
In addition, the log function is differentiable in all the range it's defined upon, which allows using the derivative easily .
In this part we will implement and train a generative adversarial network and apply it to the task of image generation.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
device = torch.device("cpu")
We'll use the same data as in Part 2.
But again, you can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw4/answers.py.
import cs3600.plot as plot
import cs3600.download
from hw4.answers import PART3_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs3600.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip exists, skipping download. Extracting C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip... Extracted 531 to C:\Users\Gil zeevi\.pytorch-datasets\lfw/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
C:\Users\Gil zeevi\anaconda3\lib\site-packages\torchvision\io\image.py:11: UserWarning: Failed to load image Python extension: Could not find module 'C:\Users\Gil zeevi\anaconda3\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
warn(f"Failed to load image Python extension: {e}")
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.
In a GAN model, two different neural networks compete against each other: A generator and a discriminator.
The Generator, which we'll denote as $\Psi _{\bb{\gamma}} : \mathcal{U} \rightarrow \mathcal{X}$, maps a latent-space variable $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ to an instance-space variable $\bb{x}$ (e.g. an image). Thus a parametric evidence distribution $p_{\bb{\gamma}}(\bb{X})$ is generated, which we typically would like to be as close as possible to the real evidence distribution, $p(\bb{X})$.
The Discriminator, $\Delta _{\bb{\delta}} : \mathcal{X} \rightarrow [0,1]$, is a network which, given an instance-space variable $\bb{x}$, returns the probability that $\bb{x}$ is real, i.e. that $\bb{x}$ was sampled from $p(\bb{X})$ and not $p_{\bb{\gamma}}(\bb{X})$.

The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:
$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.
We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.
TODO: Implement the Discriminator class in the hw4/gan.py module.
If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.
ds_gwb[0][0].shape[0] #[0].shape[1]
3
import hw4.gan as gan
dsc = gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)
d0 = dsc(x0)
print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
(disc_cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
)
(disc_fc): Linear(in_features=16384, out_features=1, bias=True)
)
torch.Size([1, 1])
TODO: Implement the Generator class in the hw4/gan.py module.
If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.
z_dim = 128
gen = gan.Generator(z_dim, 4).to(device)
print(gen)
z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
(gen_fc): Linear(in_features=128, out_features=16384, bias=False)
(gen_cnn): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): ConvTranspose2d(128, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(10): Tanh()
)
)
torch.Size([1, 3, 64, 64])
Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$
We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.
GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.
We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.
TODO: Implement the discriminator_loss_fn() function in the hw4/gan.py module.
from hw4.gan import discriminator_loss_fn
torch.manual_seed(42)
y_data = torch.rand(10) * 10
y_generated = torch.rand(10) * 10
loss = discriminator_loss_fn(y_data, y_generated, data_label=1, label_noise=0.3)
print(loss)
test.assertAlmostEqual(loss.item(), 6.4808731, delta=1e-5)
tensor(6.4809)
Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$
which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.
TODO: Implement the generator_loss_fn() function in the hw4/gan.py module.
from hw4.gan import generator_loss_fn
torch.manual_seed(42)
y_generated = torch.rand(20) * 10
loss = generator_loss_fn(y_generated, data_label=1)
print(loss)
test.assertAlmostEqual(loss.item(), 0.0222969, delta=1e-3)
tensor(0.0223)
Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.
There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).
TODO: Implement the sample() method in the Generator class within the hw4/gan.py module.
samples = gen.sample(5, with_grad=False)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNone(samples.grad_fn)
_ = plot.tensors_as_images(samples.cpu())
samples = gen.sample(5, with_grad=True)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNotNone(samples.grad_fn)
Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.
As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)
TODO:
train_batch function in the hw4/gan.py module.part3_gan_hyperparams() function within the hw4/answers.py module.import torch.optim as optim
from torch.utils.data import DataLoader
from hw4.answers import part3_gan_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part3_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return gan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'checkpoints/gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}
TODO:
save_checkpoint function in the hw4.gan module. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch._final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.import IPython.display
import tqdm
from hw4.gan import train_batch, save_checkpoint
num_epochs = 100
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
checkpoint_file = checkpoint_file_final
try:
dsc_avg_losses, gen_avg_losses = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc, gen,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen.sample(5, with_grad=False)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
*** Loading final checkpoint file checkpoints/gan_final instead of training
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = gen.sample(n=15, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(6,6))
*** Images Generated from best model:
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs3600.answers import display_answer
import hw4.answers
Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?
display_answer(hw4.answers.part3_q1)
Your answer:
The generator training process relies on sampling/generating data and then presenting it to the discriminator and then calculating the loss.
In this process we do want the generator weights to be updated hence we do save gradients in order to use back-propagation.
on the other hand, the discriminator simply behaves as a classifier,
and during that process we keep the generator constant, not saving the generator gradients,
thus enabling the discriminator to learn the generator as is and its "behaviour".
When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?
What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?
display_answer(hw4.answers.part3_q2)
Your answer:
1)
No, we shouldn't stop training solely based on low Generator loss!
The reason lies in the fact that we're trying to reach an equilibrium between the generator loss and the discriminator loss.
If the discriminator isn't accurate enough and the generator loss is very low, it simply means that the generator is performing well in fooling the discriminator.
Hence, if the Generator loss is low, we cannot conclude on the performance of the entire model due to dependency between the discriminator and generator.
2)
we have two possible interpretationts:
a. If discriminator loss is temporal stuck it might mean the discriminator is ahead of the Generator in the learning process.
the descriminator tells the difference between real and fake images thus forcing the generator to keep learning the discriminator behaviour in order to "catch up".
b. If discriminator loss is stuck permanently in its learning process and will not improve:
The generator is improving. It is getting better on generating fake images that the discriminator is not being able to discriminate.
The discriminator loss stuck at constant means it is not improving, so its accuracy for discriminating fake images from real ones is still the same,
it might be stuck in a local minimum.
Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?
display_answer(hw4.answers.part3_q3)
Your answer:
The VAE model yielded more blurry with less sharp edges which focuses on the foreground of the images, while GAN model yielded sharper images with more features and colors
hence focusing also on the background.
VAE has a term of reconstruction loss in the general loss function, which forcing the output to be similar to the input with applying MSE loss.
Moreover, The process of "generation" in VAE is done by an encoder and decoder, which makes it lose quality and not get that better results we see in GAN.
it results in smooth, blurry images, and the generated images looks more similiar to each other.
When engaging the GAN model on the other hand, the generator does not have 'direct access' to real images,
but learns how those should look through the decisions of the discriminator,forcing it's predictions to be more realistic.
the generator at the beginning can easily spot whats fake and whats real, and by the end of the learning process, hopefully,
the discriminator tend to random choice, hence its hard to tell for the discriminator whether its a fake or true image and thus makes the output more realistic.
This section contains summary questions about various topics from the course material.
You can add your answers in new cells below the questions.
Notes
Answer:
In a neural network context, the receptive field is defined as the size of the region in the input that produces the feature.(Wikipedia).
Can be explained as portion of the input needed, in order to create a specific feature that we are looking at, at any convolutional layer.
the receptive fields of different features partially overlap and as such cover the entire input space.
When stacking convolutional layers, the receptive fields are merged and each feature takes input from a larger area of pixels in the previous layer image.
As intuition, it can be compared to our eyes which see only parts of our vision, the receptive fields starts with small portion of the input and later grows as the convolutions combine them together in order to make sense of what is seen.
The receptive field size is affected by kernel size, padding and stride.
Answer:
Growing receptive fields from layer to layer depends on the following:
Pooling- Reducing the dimension of the feature map by combining features in the same region. by doing this, it affects the convolution layers by increasingly larger parts of the input image, which results in a rapid increase in the receptive field size respectively.
Stride- is how far the filter moves in every step along one direction. hence, it Determines how big the overlapping of the receptive fields between features. larger strides cause smaller overlapping portion of pixels between features thus causing larger receptive fields between layers.
Dilation- By increasing this factor, the weights are placed far away at given intervals (i.e., more sparse), and the kernel size accordingly increases. Therefore, by monotonously increasing the dilation factors through layers, the receptive fields can be effectively expanded without loss of resolution.
import torch
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
nn.ReLU(),
)
cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
torch.Size([1, 32, 122, 122])
What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?
Answer:
Each receptive field size is derived from the layer before it, hence we can calculate the network receptive field recursively.
The recursive formula for the receptive field size of the output tensor: $$ r_k = r_{k-1} + (g_k-1)\cdot \prod_{i=1}^{k-1}s_i$$
$r_k$ - receptive field at layer k
$g_k$ - kernel size for layer k
$s_k$ - stride at layer k
You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).
After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.
However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.
Answer:
The main reason that the original and residual networks produce different filters lies in the fact that the filters of the residual layer try to learn the difference between the input and the output of the layer, as denoted in given formula, which was re-arranged: $$f_l(\vec{x};\vec{\theta}_l)=\vec{y}_l-\vec{x}$$
import torch.nn as nn
p1, p2 = 0.1, 0.2
nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.Dropout(p=p1),
nn.Dropout(p=p2),
)
Sequential( (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Dropout(p=0.2, inplace=False) )
If we want to replace the two consecutive dropout layers with a single one defined as follows:
nn.Dropout(p=q)
what would the value of q need to be? Write an expression for q in terms of p1 and p2.
Answer:
Simplified in words:
Answer:
False
Usually the droput is applied after the activation functions, but nevertheless and in particular when using Relu, the dropout can be performed before applying activation,
where it is even more computationally efficient.
Answer:
$x$ denotes as an activation vector, with expectation of $\mathbb{E}[x]$ and $p$ is the dropout probability.
$\hat{x} = x\cdot(1-p)$ denoted as the same activation associated with dropout.
the Expectation of the activation vector associated with dropout will be the following: $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)]$$
Now, by using the properties of Expectation, we get: $$\mathbb{E}[\hat{x}] = \mathbb{E}[x\cdot(1-p)] = (1-p)\cdot \mathbb{E}[x]$$
$$\downarrow$$
$$\mathbb{E}[\hat{x}] = (1-p)\cdot \mathbb{E}[x]$$
Now, its easy to see that in order to maintain the value of each activation unchanged in expectation, we need to to scale the dropout activation by $1/(1-p)$
Q.E.D
Answer:
L2 loss fit for regression tasks whereas we have here a classification task. hence, binary cross entropy will do the trick, where it penalizes the model in cases of uncertainty, thus force the model to keep learning in order to predict eventually better.
Lets demonstrate how the BCE penalize better than L2 in a case of uncertainty:
lets suppose $p_{dog} = 0.45$. which is quite an uncertain classifaction score, almost as 50-50 score. lets see which loss penalize more:
$$L_2(p_{dog} = 0.45)= (1-0.45)^2 = 0.3 \\ L_{BCE}(p_{dog} = 0.45)= -log(0.45) = 0.798 $$
We see that $L_{BCE} > L_2$ and thus penalize better the uncertain prediction, because higher loss will force the model to train more than a lower loss score.

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe.
You define your model as follows:
import torch.nn as nn
N = 42 # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
nn.Linear(in_features=N, out_features=H),
nn.Sigmoid(),
*[
nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
]*24,
nn.Linear(in_features=H, out_features=1),
)
While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?
Answer:
The chosen architecture seems to be deeper than necessary, without any batch normalizations and skip connections.
Skip connections in deep architectures, as the name suggests, skip some layer in the neural network and feeds the output of one layer as the input to the next layers.
when all of the mentioned above isn't applied, we can witness a phenomenon called 'vanishing gradients.
The gradient becomes very small as we approach the earlier layers in a deep architecture. In some cases, the gradient becomes zero, meaning that we do not update the early layers at all, hence evntually the model stops its training.
sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.Answer:
Tanh gradient also gets close to zero in a fast manner when the input gets far from zero, same as sigmoid. furthermore, tanh derivative - $sech^2$ also is bound on (0,1] as sigmoid therefore this activation change wont make any significant difference.
ReLU on the other hand, keeps linearity in the intervals where sigmoid and tanh are tend to 1, thus dealing better with vanishing gradients.
Answer:
4.1. False - Activation functions are not the only possible reason for vanishing gradients. vanishing gradients could be a result of too deep network without skip connections as well.
4.2. False - The gradient of ReLU is constant (which equals to 1) whenever the input is positive.
4.3. True - inputting negative numbers into ReLU causes an output of 0 and gradient of 0 as well. hence those weights with respect to a specific neuron dont update thus causing dead neuron. this is why the leakyReLU comes in handy, in order to deal just with these kind of cases
Answer:
The difference between the optimizers above lies in the amount of samples used to train.
where GD using all of the training samples in each update, SGD using one random sample per each update and mini-batch SGD uses a fixed small 'batch' of samples in each update.
Where it is obvious that in very large datasets GD is too expensive or even impossible in term of calculation time and memory, the SGD are more feasible and quicker to converge to local minimum. mini-batch SGD in that manner is obviously better than the one sample SGD which its updates are rather too arbitrary.
Answer:
2.1.
i. Slow Training: In GD, each update of the gradient can take alot of time due to the fact that GD updates all training samples per each iteration.
ii. SGD might generalize better due to adding some randomness to the model update with the selected samples while GD might overfit due to training on the full data.
2.2.
i. When the dataset is too large, or/and we know in advance that we have a low memory machine. using a great amount of samples will just train really slow in the best case or will cause the memory to run out on the worst cast
Answer:
Number of iterations will surely decrease since there are more samples for the model to learn from in each update thus the gradient direction would be more accurate and less noisy resulting in smaller loss in each iteration..
Nevertheless, even if now less iterations needed for convergence, it doesnt mean that the training time will be faster, and probably on the contrary, it would take longer.
Answer:
4.1. True - We perform optimization step for each sample in every epoch using only one sample or a batch (mini-batch sgd).
4.2. False - SGD uses only one sample, thus the variance will be higher in such case. nonetheless this behaviour will lead to faster convergence but rather less stable.
4.3. True - Due to the fact that per each update, new random samples are trained, this situation may lead the gradient to get out local minimas towards the global minima, unlike the classical GD which tends to securly and in a stable manner to hit local minime.
4.4. False - As already stated above, GD consume more memory as it trains all of the data samples, instead of a smaller batch as SGD.
4.5. False - GD have bigger chance to hit local minimas than SGD due to training all training samples which leads to a stable, same direction of the gradient.
the SGD method gradient is less stable and the directions may vary in each iterations due to a different 'chunk' of training samples in each iterations leading to more chances to get out of local minimas, if stumbled upon.
4.6. True - Newton method uses second order derivatives which are computationally more expensive than first order SGD & momentum. furthermore, newton method tend to stuck in saddle points which can be located in narrow ravine surfaces whereas SGD with momentum stables the oscillations in a narrow ravine which vanilla SGD suffers from.
Answer:
6.1.
vanishing gradients - the gradiends decrease as the propagations occurs through the networks until they reach a stage when they are too small and considered as 0.
exploding gradients - the gradiends increase as the propagations occurs through the networks until they reach a stage when they are too big which causes the update step to be too large such that the optimizer cant fint a minimum.
To sum up those phenomena, due to the chain rule of multipication big gradients get drastically increased and small gradients get drastically decreased.
6.2.
As written above, when propogating the gradients, the current gradient value is multiplied by the value of the previous layer's gradient and so on. increaseing the layers meaning increasing the multipications thus increasing large numbers to even greater numbers and small numbers to even smaller numbers. either too high or too low gradients with too deep network can rapidly increase to infinity or decay to 0, respectivly.
6.3.
Lets assume we have a deep'ish NN with n layers. each layer multiply the input x with a weight w. finally the the result is going through the activation function.
let's denote the following with an assumption that all weights throguhout the network are equal in the magnitude in general and in our particular exam really equal: $$ w= 0.5$$
lets assume we have $n=10_{layers}$. so after the weights have been multiplied by 0.5 in each layer we have $ w= 0.5^{10} = 0.0009$ which will vanish with more layers as it tend to 0.
Now if we switch to high weight, lets say $w=5$ we can see that after 10 layers $ w= 5^{10} = 9765625$ which is exploding if kept going deeper through the net which is described as exploding gradients.
6.4.
It can be interpreted by looking at the loss functions values and curve:
You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$
Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.
Answer:
We first denote the following: $$Z = \mat{W}_1 \vec{x}+ \vec{b}_1$$
Now, deriving by $x$:
$$ \frac{dL_s}{dx} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dx} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot W_1 \cdot\frac{d\varphi}{dz}$$
Deriving by $b_1$:
$$ \frac{dL_s}{db_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_1}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_1} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\cdot\frac{d\varphi}{dz} $$
Deriving by $b_2$:
$$ \frac{dL_s}{db_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{db_2}= \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{db_2} = \frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}}) $$
Deriving by $W_1$:
$$ \frac{dL_s}{dW_1} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_1} + \lambda\norm{\mat{W}_1}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_1}+ \lambda\norm{\mat{W}_1}_F= \newline
\frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot W_2\frac{d\varphi}{dZ}x + \lambda\norm{\mat{W}_1}_F $$
Deriving by $W_2$:
$$ \frac{dL_s}{dW_2} = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{dW_2} + \lambda\norm{\mat{W}_2}_F = \frac{1}{N}\sum_{i=1}^{N}\frac{d\ell}{d\hat{y}}\frac{d\hat{y}}{dW_2}+ \lambda\norm{\mat{W}_2}_F = \newline
\frac{1}{N}\sum_{i=1}^{N}(-\frac{y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}})\cdot \varphi(Z) + \lambda\norm{\mat{W}_2}_F $$
The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is $$ f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}} $$
Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
What are the drawbacks of this approach? List at least two drawbacks compared to AD.
Answer:
The resulting drawbacks:
loss w.r.t. W and b using the approach of numerical gradients from the previous question.torch.allclose() that your numerical gradient is close to autograd's gradient.import torch
N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)
def foo(W, b):
return torch.mean(X @ W + b)
loss = foo(W, b)
print(f"{loss=}")
# TODO: Calculate gradients numerically for W and b
grad_W =torch.zeros_like(W)
grad_b =torch.zeros_like(b)
eps = 1e-6
for i in range(d): # b update
b_tag = torch.clone(b)
b_tag[i]+= eps
fdx = foo(W,b_tag)
f= foo(W,b)
grad_b[i] = (fdx-f)/eps
for i in range(d): # W update
for j in range(d):
w_tag = torch.clone(W)
w_tag[i,j]+= eps
fdx = foo(w_tag,b)
f= foo(W,b)
grad_W[i,j] = (fdx-f)/eps
loss.backward()
# TODO: Compare with autograd using torch.allclose()
autograd_W = W.grad
autograd_b = b.grad
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)
loss=tensor(1.5294, dtype=torch.float64, grad_fn=<MeanBackward0>)
Answer:
Y contain? why this output shape?nn.Embedding yourself using only torch tensors. import torch.nn as nn
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")
Y.shape=torch.Size([5, 6, 7, 8, 42000])
Answer:
Answer:
3.4. False The backpropagation algorithm remains the same. TBPTT introduces a limitation in the calculation to only consider X amount of steps. This tecnique is useful to deal with vanishing gradients.
3.5. False. As stated, TBPTT limits the steps so that the number of derivatives required for weight update is controlled. By limiting the lenght of sequence, we don't make sure these steps are truncated. TBPTT implementation is based on limiting the timesteps per run.
3.6. True. As the algorithm truncates to S timesteps, it means that during forward pass and backpropagation, dedicated memory can only store those steps. For any new input, relations can be found within the available previous S timesteps that are being considered in the step run.
In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?
Answer:
As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:
Answer:
The KL-divergence term plays the role of a regularizer to avoid overfitting to the training set. Not including it will:
Answer:
Answer:
Answer:
The Intersection-Over-Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. it can give a numerical score whether a predicted segment is close enough to the ground truth from 0(no match) to 1 - perfect prediction.
Dice on the other hand is twice(2X) the area of overlap between the predicted segmentation and the ground truth divided by the combined area of prediction and ground truth.
Dice can be used at similar circumstances as IoU and they are often both used.
but there is a subtle difference between them though: Dice score tend to veer towards the average performance whereas the IoU helps you understand worst case performance
So, in general,we can use IoU to determine for each segment if the prediction is TP/FP,FN.
afterwards we can build the precision-recall curve and use mAP to generalize it into a single value representing the average of all precisions from all the segments.
Nowdays, using mAP makes more sense as it is a better representation of the quality of the model, rather then using F1(Dice) to understand the imbalances between the precision and recalls of the segments.
Answer:
YOLO - one stage detector. demands single pass to NN in order to predict all bounding boxes/areas.
mask-r-CNN - two stage detector. first uses RPN to generate regions of interest.
RPN outputs:
A Region Proposal Network, or RPN, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals where it takes an image as input and output set of bounding boxes proposels with repective score.
YOLO outputs:
YOLO is formed of 27 CNN layers, with 24 convolutional layers, two fully connected layers, and a final detection layer.
YOLO divides the input images into an N by N grid cell, then during the processing, predicts for each one of them several bounding boxes to predict the object to be detected.
In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.
project/ directory. You can import these files here, as we do for the homeworks.One of the prevailing approaches for improving training stability for GANs is to use a technique called Spectral Normalization to normalize the largest singular value of a weight matrix so that it equals 1.
This approach is generally applied to the discriminator's weights in order to stabilize training. The resulting model is sometimes referred to as a SN-GAN.
See Appendix A in the linked paper for the exact algorithm. You can also use pytorch's spectral_norm.
Another very common improvement to the vanilla GAN is known a Wasserstein GAN (WGAN). It uses a simple modification to the loss function, with strong theoretical justifications based on the Wasserstein (earth-mover's) distance. See the tutorial or here for a brief explanation of this loss function.
One problem with generative models for images is that it's difficult to objectively assess the quality of the resulting images. To also obtain a quantitative score for the images generated by each model, we'll use the Inception Score. This uses a pre-trained Inception CNN model on the generated images and computes a score based on the predicted probability for each class. Although not a perfect proxy for subjective quality, it's commonly used a way to compare generative models. You can use an implementation of this score that you find online, e.g. this one or implement it yourself.
You would gain a Bonus if you also adress Gradient Penalty, as we saw in the tutorial that it could improve the robustness of the GAN and essentially improve the results
Based on the linked papers, add Spectral Normalization and the Wassertein loss to your GAN from HW3. Compare between:
As a dataset, you can use LFW as in HW3 or CelebA, or even choose a custom dataset (note that there's a dataloder for CelebA in torchvision).
Your results should include:
TODO: This is where you should write your explanations and implement the code to display the results. See guidelines about what to include in this section.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import tqdm
import pickle
import numpy as np
import torch
import matplotlib.pyplot as plt
from project.inception import *
from project.gan_models import *
import torch.optim as optim
from torch.utils.data import DataLoader
import project.gan_models as gan
from project.hyperparams import *
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
C:\Users\Gil zeevi\anaconda3\lib\site-packages\torchvision\io\image.py:11: UserWarning: Failed to load image Python extension: Could not find module 'C:\Users\Gil zeevi\anaconda3\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
warn(f"Failed to load image Python extension: {e}")
Using device: cuda
We chose to keep following the same Bush dataset of VAE & GAN notebooks so that we can conduct an 'eye test' for the results of the project versus the VAE and GAN notebooks:
ds_gwb = load_bush_dataset()
File C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip exists, skipping download. Extracting C:\Users\Gil zeevi\.pytorch-datasets\lfw-bush.zip... Extracted 531 to C:\Users\Gil zeevi\.pytorch-datasets\lfw/George_W_Bush Found 530 images in dataset folder.
Firstly, lets present the vanilla gan on all of its components:
vanil_gan = load_vanilla_GAN(ds_gwb)
*** Chosen Discriminator Architecture for Vanilla GAN: ***
Discriminator(
(disc_cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
)
(disc_fc): Linear(in_features=16384, out_features=1, bias=True)
)
*** Chosen Generator Architecture for Vanilla GAN: ***
Generator(
(gen_fc): Linear(in_features=128, out_features=16384, bias=False)
(gen_cnn): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): ConvTranspose2d(128, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(10): Tanh()
)
)
*** Chosen HyperParameters for Vanilla GAN: ***
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}
The following samples from Vanilla GAN were reproduced with GANTrainer.ipynb notebook. load_SN_GAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer':
{'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}
Now, for SN_gan we'll just use torch.nn.utils.spectral_norm on each of our discriminator's modules and use the same hyperparameters as trained for VanillaGAN to see the differences
we trained the python file 'SNgan.py' on the Part2_GAN.ipynb to get the following results:
SN_gan =load_SN_GAN(ds_gwb,SN_act=True)
*** Chosen Discriminator Architecture for SNGAN: ***
Discriminator(
(disc_cnn): Sequential(
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
)
(disc_fc): Linear(in_features=16384, out_features=1, bias=True)
)
*** Chosen Generator Architecture for SNGAN: ***
Generator(
(gen_fc): Linear(in_features=128, out_features=16384, bias=False)
(gen_cnn): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
(6): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.2)
(9): ConvTranspose2d(128, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(10): Tanh()
)
)
*** Chosen HyperParameters for SNGAN: ***
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}
The following samples from Vanilla GAN were reproduced with GANTrainer.ipynb notebook. load_SN_GAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'data_label': 1, 'label_noise': 0.1, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.3, 0.999)}}
to conclude, the visual eye comparison between vanilla GAN and vanilla GAN with SN samples is the following:
the following was produced by running compare_imgs(device) which loads again the checkpoints that were used above
We'll present the learning process where for each epoch we'll see the resulted inception score which was calculated based on 1000 samples of each model's generator.
we present two kind of graphs,
1) As Is plot - presents the amplitude of inception score in the learning process.
2) Trend plot - presents a polynomial splines trend of the inception, in order to see the trend of the inception during training in order to assess best inception scores throughout training.
#loading training pickles
names = ['gan_32_8_0.1','gan_sn_32_8_0.1']+['wgan_32_8_'+str(i) for i in [1,2,5,10,20]]+['wgan_sn_32_8_'+str(i) for i in [1,2,5,10,20]]
models = {'gan':[],
'gan_sn':[],
'wgan':[],
'wgan_sn':[]}
nums = {0:1,1:2,2:5,3:10,4:20}
for name in names:
with open(f'project/{name}.pickle', 'rb') as handle:
score = pickle.load(handle)['score']
tmp = name.split('_')
if tmp[1]=='sn':
models[f'{tmp[0]}_sn'].append(score)
else:
models[f'{tmp[0]}'].append(score)
plot_inception(models,['gan','gan_sn'],epochs = 100,k = 5)
print(f"\nGan maximal inception score = {max(models['gan'][0]):.3f}\n")
print(f"SN_Gan maximal inception score = {max(models['gan_sn'][0]):.3f}")
Gan maximal inception score = 2.685 SN_Gan maximal inception score = 2.863
We can see that spectral normalized gan model yielded slightly better inception scores trend and higher maximal score
Now' we'll present our Wgan findings.
our Wgan was based on the same CNN architecture as the vanilla GAN' which is printed above so we'll skip reprinting it.
We will focus on tuning the best critic parameter using the inception score metric. after that we'll present a sample of the produced images for the best chosen model.
note:
plot_inception(models,['wgan'],epochs = 100,k=5)
for i in range(5):
print(f"\nWGan with n_critic of {nums[i]} maximal inception score = {max(models['wgan'][i]):.3f}\n")
WGan with n_critic of 1 maximal inception score = 3.305 WGan with n_critic of 2 maximal inception score = 3.041 WGan with n_critic of 5 maximal inception score = 3.039 WGan with n_critic of 10 maximal inception score = 2.898 WGan with n_critic of 20 maximal inception score = 2.954
We see that from result that even though WGan with n_critic of 1 yielded the maximal inception score, WGan with n_critic of 5 had the highest inception score trend thus in general we say that during the learning process, the generator produces better samples, based on inception metric of course.
We also noted that n_critic of 2 presented good IS results until the 50th epoch and then became less unstable and produced worse IS score than n_critic of 5 so we decided to choose n_critic of 5.
Wgan samples with $n_{critic} = 5$ :
The following samples from WGAN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 5, 'c': 0.01}
Now, We'll present the same process for WGAN with spectral norm
plot_inception(models,['wgan_sn'],k=3)
for i in range(5):
print(f"\nWGan with n_critic of {nums[i]} maximal inception score = {max(models['wgan_sn'][i]):.3f}\n")
WGan with n_critic of 1 maximal inception score = 3.080 WGan with n_critic of 2 maximal inception score = 2.998 WGan with n_critic of 5 maximal inception score = 2.951 WGan with n_critic of 10 maximal inception score = 3.075 WGan with n_critic of 20 maximal inception score = 2.984
Actually here we are uncertain whoch n_critic parameter would produce the highest score because n_critic of 20 actually resulted in a high peak trend but n_critic of 5 showed a stable monotonically increasing trend.
So, we will inspect both!
The following samples from WGAN_SN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 5, 'c': 0.01}
The following samples from WGAN_SN were reproduced with GANTrainer.ipynb notebook. load_WGAN() function loads a checkpoint that was saved in colab with the following hyperparameters:
{'batch_size': 32, 'z_dim': 8, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 0.0005}, 'n_critic': 20, 'c': 0.01}
It is actually is kinda difficult to decide which $n_{critic}$ is better so we'll stick to $n_{critic} = 5$ for consistency when comparing WGAN to WGAN_SN
Lets present another samples of WGAN with SN and $n_{critic} = 5$ just for good sports:

comp_models = {'gan':models['gan'],
'gan_sn':models['gan_sn'],
'wgan':[models['wgan'][2]],
'wgan_sn':[models['wgan_sn'][2]]}
plot_inception(comp_models,['gan','gan_sn','wgan','wgan_sn'],nums={0:5},nums_sn = {0:5},k=5)
for model in list(comp_models.keys()):
print(f"\n{model.upper()} maximal inception score = {max(comp_models[model][0]):.3f}\n")
GAN maximal inception score = 2.685 GAN_SN maximal inception score = 2.863 WGAN maximal inception score = 3.039 WGAN_SN maximal inception score = 2.951
We tried to determine the best ncritic of WGAN with inception score. while the maximal score of $n{critic} = 1$ was higher than $n{critic} = 5$, in overall the trend of $n{critic} = 5$ showed higher results, which also fit what publications claim.
With WGAN_SN we were uncertain which ncritic hyperparameter outperforms the others. we spotted $n{critic} = 20$ had the best trend, but $n{critic} = 5$ had monotonically increasing trend which is also what we would like to see from a IS curve through training. we decided to train both of these models and the sampled pictures were indecisive. we couldnt confidently say which model's $n{critic}$ performed better. hence, we decided to pick both for WGAN and WGAN_SN $n_{critic} = 5$ just for consistency and to witness what SN engages in the same type of model